Energy consumption is a critical concern worldwide due to its impact on the environment, economy, and human welfare. Therefore, understanding the factors that influence energy consumption in buildings is essential to optimize energy use and minimize its negative effects. Multiple linear regression is a statistical method used to model the relationship between a dependent variable and several independent variables simultaneously. In this report, we perform a multiple linear regression analysis to investigate the factors that affect energy consumption. The analysis is based on a dataset that includes information on natural gas consumption and several variables related to weather conditions (such as the mean external temperature and the irradiance). The objective of this study is to identify the significant predictors of energy consumption and provide insights into the underlying mechanisms that drive energy use.
The report is organized in the following sections:
The dataset utilized in this analysis is composed by 3 numerical variables, total daily gas consumption Energy \([Smc]\), mean daily external temperature Text \([°C]\), and mean solar irradiance Iext \([W/m^2]\) and 1 categorical variable, the day of the week DayofWeek.
The dataset provides daily measurements of these variables for a full heating season in Turin, which goes from \(1^{st}\) November to \(31^{th}\) March, resulting in a total of 151 records.
In the table below is reported a sketch of the dataset.
The trend of the variables during the heating season is represented in the figure below.
Will be useful for the further steps to summarize the dataset in terms of statistical quantities and distributions:
## date DayOfTheWeek Text Iext
## Min. :2017-11-01 Min. :1.00 Min. :-5.950 Min. : 0.50
## 1st Qu.:2017-12-08 1st Qu.:2.00 1st Qu.:-0.115 1st Qu.: 3.48
## Median :2018-01-15 Median :4.00 Median : 2.920 Median : 34.34
## Mean :2018-01-15 Mean :4.04 Mean : 3.103 Mean : 41.23
## 3rd Qu.:2018-02-21 3rd Qu.:6.00 3rd Qu.: 6.605 3rd Qu.: 71.47
## Max. :2018-03-31 Max. :7.00 Max. :11.610 Max. :182.10
## Energy day_name
## Min. : 0.0 Length:151
## 1st Qu.:257.1 Class :character
## Median :389.2 Mode :character
## Mean :382.2
## 3rd Qu.:556.2
## Max. :676.8
In this section, an outlier detection process is employed with the aim to identify possible values of the variables analyzed that can be consider far enough from the distribution of data and that can lead to incorrect or misleading conclusions when developing a multiple regression model.
In this case one way to operate could be the use of the Cook’s distance, which is a measure of the influence of each observation on a regression analysis. It can be used to identify multivariate outliers in non-normal distributions, like ours, by examining the values of Cook’s distance for each observation. Large values of Cook’s distance indicate observations that are having a disproportionate influence on the regression analysis, which could be due to being outliers.
Cook’s distance is evaluated as:
\[D_i = \frac{\sum_{j=1}^n (\hat{y_j} - \hat{y_{j(i)}})}{ps^2}\] where \(\hat{y_j}\) is the predition of the mean using the j observation and \(\hat{y_{j(i)}}\) is the prediction of the mean without the i-observation, \(s^2\) is the mean square error and \(p\) is the number of independent variables.
To better visualize the dataset, a 3D scatter plot is reported in the figure below, coloring in different ways the \(DayoftheWeek\).
As we can easily seen, Sundays are day of the week where there is no energy consumption, so can be easily eliminated from the model to improve the accuracy.
Now we can perform the model and evaluate the Cook’s distance:
A thumb’s rule using the Cook’s distance to outlier detection is considering a threshold value of \(4/n\), where \(n\) is the number of observations (130). So, records with Cook’s distance higher than \(0.031\) are considered outliers and eliminated from the model to make it more accurate.
Let’s plot the Cook’s distances and the threshold identified:
How we can see, 4 outliers have been identified using this metric, which are 2017-11-13, 2018-01-11, 2018-02-20, 2018-02-23.
Now we can eliminate these data and re-perform the linear regression model, evaluating its performance.
Once data are cleaned, it is possible to perform a linear regression model using the external temperature \(T_{ext}\) and \(I_{ext}\) as predictors and independent variables for \(Energy\).
##
## Call:
## lm(formula = Energy ~ Text + Iext, data = data_final)
##
## Residuals:
## Min 1Q Median 3Q Max
## -88.589 -35.919 -9.473 38.182 92.225
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 567.6156 6.4319 88.250 < 2e-16 ***
## Text -33.2272 0.9738 -34.123 < 2e-16 ***
## Iext -0.5605 0.1072 -5.228 7.12e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 44.35 on 123 degrees of freedom
## Multiple R-squared: 0.909, Adjusted R-squared: 0.9076
## F-statistic: 614.6 on 2 and 123 DF, p-value: < 2.2e-16
How can be easily observed, the regression model employed yields a robust result. In fact we can observe the following features:
Furthermore, the coefficients found have a negative sign, meaning that the energy consumption become higher if the external temperature and irradiance assume a lower value, because free gains are minor. The external temperature is the one that accounts for more variation in the prediction and in fact has a lower p-value, in comparison with the irradiance, which has a less importance, but still notable, in the model.
For completeness we show also some plot metric used to visualize the strength of the regression model.
For example, the Fitted vs Residuals plot (on the left) shows that residuals are higher around the center of the distribution of gas consumption, so the model will be less accurate in that zone compared to the extreme ones.
In this report, a multiple regression model has been developed using energy-related data. In particular, the process has consisted of a data visualization step, where the distribution of the variables involved have been analyzed; a robust step of outlier detection, in order to make more robust the model to be develop; the development of the multiple regression model, which resulted in an high \(R^2\) value and an easily interpretable result of the principal driven of the energy consumption in a building.
The model exploited is at the same time simple but complete, reaching an high value of accuracy using only two metereological variables: temperature and irradiance. In order to enhance the model’s robustness without incorporating categorical variables, such as the day of the week, which pose difficulties in regression models, Sundays were intentionally omitted.
From the experience gained over time, other important variables that can help to reach an higher level of the prediction are related on the occupancy behaviors, which can lead to the maximum reliable of these models.